weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")tidyweather <- weather %>%
select(1:13) %>% #Selecting Year and month variables
pivot_longer(cols=2:13, names_to='month', values_to='delta') #Tidying the data from wide to long format so that we have a column for the months and the corresponding temperature data respectivelyskim(tidyweather)| Name | tidyweather |
| Number of rows | 1704 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| month | 0 | 1 | 3 | 3 | 0 | 12 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Year | 0 | 1 | 1950.50 | 41.00 | 1880.00 | 1915.00 | 1950 | 1986.00 | 2021.00 | ▇▇▇▇▇ |
| delta | 5 | 1 | 0.08 | 0.47 | -1.52 | -0.24 | 0 | 0.31 | 1.94 | ▁▆▇▂▁ |
#Creating date variables for the tidyweather dataset
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), month, "1")), #Creating a column called date
month = month(date, label=TRUE), #Converting month column into an ordered date factor
year = year(date)) #Converting the Year column into an ordered date factor
#Plotting temperature by date
ggplot(tidyweather, aes(x=date, y = delta))+ #Plotting delta by date
geom_point()+ #Scatterplot
geom_smooth(color="red") + #Adding a red trend line
theme_bw() + #theme
labs (#Adding a labels
title = "Weather Anomalies",
x = "Date",
y = "Delta"
) +
NULLtidyweather %>%
ggplot(aes(x=Year, y=delta)) + #Plotting delta by Year
geom_point() + #Scatterplot
geom_smooth(color="red") + #Adding a red trend line
theme_bw() + #theme
facet_wrap(~month) + #Creating separate graphs for each month
labs (#Adding a labels
title = "Weather Anomalies per Month",
x = "Year",
y = "Delta"
) +
NULLAnswer below
Although all of the graphs in the grid have a similar upwards trend, there are subtle differences in variability between months such as December/January and June/July. January is a month with much higher variability in weather while June does not. This is something that may be worth looking into for meteorologists.
comparison <- tidyweather %>% #New data frame called comparison
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))delta), grouped by intervals we are interested inggplot(comparison, aes(x=delta, fill=interval)) +
geom_density(alpha=0.2) + #density plot with tranparency set to 20%
theme_bw() + #theme
labs (
title = "Density Plot for Monthly Temperature Anomalies",
y = "Density" #changing y-axis label to sentence case
)average_annual_anomaly <- tidyweather %>%
filter(!is.na(delta)) %>% #Removing rows with NA's in the delta column
group_by(Year) %>%
summarise(
annual_average_delta=mean(delta)) #New column annual_average_delta to calculate the mean delta by year
ggplot(average_annual_anomaly, aes(x=Year, y=annual_average_delta))+
geom_point() + #Scatterplot of annual_average_delta over the years
geom_smooth() + #Trend line
theme_bw() + #Theme
labs (
title = "Average Yearly Anomaly", #Title
y = "Average Annual Delta" #y-axis label
) +
NULLdeltaNASA points out on their website that
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.
formula_ci <- comparison %>%
group_by(interval) %>%
# calculate mean, SD, count, SE, lower/upper 95% CI
summarise(
mean=mean(delta, na.rm=T), #mean
sd=sd(delta, na.rm=T), #standard deviation
count=n(), #number of datapoints
se=sd/sqrt(count), #standard error
t_critical=qt(0.975, count-1), #t-critical using quantile function
lower=mean-t_critical*se, #lower end of CI
upper=mean+t_critical*se) %>% #upper end of CI
# choose the interval 2011-present
filter(interval == '2011-present')
formula_ci## # A tibble: 1 × 8
## interval mean sd count se t_critical lower upper
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2011-present 1.06 0.276 132 0.0240 1.98 1.01 1.11
boot_ci <- comparison %>%
group_by(interval) %>%
filter(interval == '2011-present') %>%
specify(response=delta) %>% #Setting delta as the response variable
generate(reps=1000, type='bootstrap') %>% #Repeating 1000 reps
calculate(stat='mean') %>% #Calculating mean
get_confidence_interval(level=0.95, type='percentile') #Calculating confidence interval
boot_ci## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.01 1.11
Answer below
We construct a 95% confidence interval both using the formula and a bootstrap simulation. The result shows that the true mean lies within the interval calculated with 95% confidence. The fact that this confidence interval does not contain zero shows that the difference between the means is statistically significant. Hence, using our result, we can conclude that the increase in temprature is statistically significant and that global warming is progressing.
# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv')
glimpse(approval_polllist)## Rows: 1,597
## Columns: 22
## $ president <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate <chr> "1/24/2021", "1/24/2021", "1/25/2021", "1/25/2021"…
## $ enddate <chr> "1/26/2021", "1/27/2021", "1/27/2021", "1/26/2021"…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate <chr> "1/27/2021", "2/1/2021", "1/28/2021", "2/5/2021", …
## $ timestamp <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
# Use `lubridate` to fix dates, as they are given as characters.
approval_polllist <- approval_polllist %>%
mutate(
modeldate=mdy(modeldate),
startdate=mdy(startdate),
enddate=mdy(enddate),
createddate=mdy(createddate)
)
glimpse(approval_polllist)## Rows: 1,597
## Columns: 22
## $ president <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <date> 2021-09-10, 2021-09-10, 2021-09-10, 2021-09-10, 2…
## $ startdate <date> 2021-01-24, 2021-01-24, 2021-01-25, 2021-01-25, 2…
## $ enddate <date> 2021-01-26, 2021-01-27, 2021-01-27, 2021-01-26, 2…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate <date> 2021-01-27, 2021-02-01, 2021-01-28, 2021-02-05, 2…
## $ timestamp <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
knitr::include_graphics("../images/biden_approval_margin.png", error = FALSE)plot <- approval_polllist %>%
mutate(week=week(enddate)) %>% #Creating a new column called week by extracting the week from the enddate variable
group_by(week) %>%
mutate(
net_approval_rate=approve-disapprove #Creating a new column called net_approval_rate by subtracting disapprove from approve
) %>%
summarise(
mean=mean(net_approval_rate), #Mean net approval by week
sd=sd(net_approval_rate), #Standard deviation of net approval by week
count=n(), #Count by week
se=sd/sqrt(count), #Standard error of the week
t_critical=qt(0.975, count-1), #T-critical value
lower=mean-t_critical*se, #Lower end of the CI
upper=mean+t_critical*se #Upper end of the CI
) %>%
#Scatterplot of the calculated net approval rate means by week
ggplot(aes(x=week, y=mean)) +
geom_point(colour='red') + #Scatterplot using red points
geom_line(colour='red', size=0.25) + #Adding a red line to connect the points
geom_ribbon(aes(ymin=lower, ymax=upper), colour='red', linetype=1, alpha=0.1, size=0.25) +
geom_smooth(se=F) + #Adding a smooth line for the trend
geom_hline(yintercept=0, color='orange', size=2) + #Adding an orange horizontal line
theme_bw() + #Theme
labs(title='Estimating Approval Margin (approve-disapprove) for Joe Biden', #Adding a title
subtitle='Weekly average of all polls', #Subtitle
x='Week of the year', #X-label
y='Average Approval Margin (Approve - Disapprove)') + #Y-label
NULL
ggsave(file='biden_plot.png', plot=plot, width=12, height=8) #Saving to adjust image width
knitr::include_graphics("biden_plot.png", error=F)Answer below
The confidence interval for ‘week 4’ ranges from 9.14 to 19.6828 with a mean of 14.41 and standard deviation of 10.25, while ‘week 25’ ranges from 10.30 to 12.7523 with a mean of 11.53 and a standard deviation of 4.74. This is mainly due to the number of data points. For ‘week 4’ we only have 17 data points to work with, while ‘week 25’ has 60. With a larger set of data to work with, we are able to create narrower intervals with the same level of confidence.
# load gapminder HIV data
hiv <- read_csv("../data/adults_with_hiv_percent_age_15_49.csv")
life_expectancy <- read_csv("../data/life_expectancy_years.csv")
# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")
library(wbstats)
worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
indicator = indicators,
start_date = 1960,
end_date = 2016)
# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels, from the World Bank API
countries <- wbstats::wb_cachelist$countries#Tidying the gapminder HIV data
hiv_long <- hiv %>%
pivot_longer(cols=2:34, names_to='year', values_to='hiv') %>% #Move all years to a new column called 'year' and the values to a new column called 'hiv'
mutate(year=as.numeric(year)) #Read 'year' as number
#Tidying the life expectancy data
life_expectancy_long <- life_expectancy %>%
pivot_longer(cols=2:302, names_to='year', values_to='lifeExp') %>% #Move all years to a new column called 'year' and the values to a new column called 'lifeExp'
mutate(year=as.numeric(year)) #Read 'year' as number
#Tidying World bank data from wbstats
worldbank_data_pretty_much_long <- worldbank_data %>%
select(-iso2c, -iso3c) %>% #Delete '-iso2c' and '-iso3c'
rename(year=date) #Rename date as year
data_join <- hiv_long %>%
inner_join(life_expectancy_long, by=c('country', 'year')) %>%#joining hiv_long with life_expectancy_long
full_join(worldbank_data_pretty_much_long, by=c('country', 'year')) #joining the new data set above with the worldbank_data_pretty_much_long data set
data_join## # A tibble: 12,732 × 8
## country year hiv lifeExp NY.GDP.PCAP.KD SE.PRM.NENR SH.DYN.MORT
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1979 NA 44.4 NA NA 248.
## 2 Afghanistan 1980 NA 44.1 NA NA 242.
## 3 Afghanistan 1981 NA 44.9 NA NA 235.
## 4 Afghanistan 1982 NA 44.6 NA NA 229.
## 5 Afghanistan 1983 NA 42.8 NA NA 222.
## 6 Afghanistan 1984 NA 40.5 NA NA 216.
## 7 Afghanistan 1985 NA 42.4 NA NA 209.
## 8 Afghanistan 1986 NA 43.4 NA NA 203.
## 9 Afghanistan 1987 NA 45.5 NA NA 196.
## 10 Afghanistan 1988 NA 47.9 NA NA 190.
## # … with 12,722 more rows, and 1 more variable: SP.DYN.TFRT.IN <dbl>
Answer below
The inner_join operation joins two data sets by matching common identifiers between the data sets and eliminating all data points that do not match. On the other hand, full_join also matches common identifiers but maintains all data points that do not exist in the smaller data set. The reason why we used inner_join to join hiv_long and life_expectancy is because we need the data on hiv to match that of life expectancy to create the graph on HIV prevalence and life expectancy as shown below. We needed to use full_join instead of inner_join to include the world bank data, however, because we must look at the relationship between fertility rate and GDP per capita in the later questions and both columns belong to the world bank data. Since we dont want to reduce the data available in the world bank data to that of HIV and life expectancy, which have less countries and smaller time frame, we must use full_join.
data_join %>%
mutate(region=countrycode(country, origin='country.name', destination='region')) %>% #Extracting region from country name and creating a new column called 'region'
filter(year >= 1970) %>% #Filter all years beyond 1970
mutate(
decadeStart=year%/%10*10,
interval=paste(decadeStart, '-', decadeStart+9)) %>% #Creating a new column called 'interval' for decades
select(-decadeStart) %>% #Deleting decadeStart column
ggplot(aes(x=hiv, y=lifeExp)) + #Creating a scatterplot for hiv and lifeExp
geom_point(alpha=0.25) + #Creating see through points
geom_smooth(se=F) + #Adding a smooth line
facet_wrap(~region, scales='free') + #Creating different graphs for every region
labs(title="Relationship between HIV prevalence and life expectancy by region", x="HIV prevalence", y="Life expectancy") +
NULLAnswer below
Although, the plots may look confusing, we can argue that the data is concentrated towards to top left corner which means times with less HIV prevalence have higher life expectancy overall. We are able to see this trend strongly for regions such as Latin America & the Caribbeans and Middle East and Africa. However, in developed regions the trends are not as obvious and there is large variability in all regions due to confounding variables such as other means by which people die early, such as car crashes and other diseases.
data_join %>%
mutate(region=countrycode(country, origin='country.name', destination='region')) %>% #Extracting region from country name and creating a new column called 'region'
ggplot(aes(x=SP.DYN.TFRT.IN, y=NY.GDP.PCAP.KD)) + #Scatterplot of fertility rate and GDP per capita
geom_point(alpha=0.25) + #Creating see through points
geom_smooth(se=F) + #Adding a smooth line
facet_wrap(~region, scales='free') + #Creating different graphs for every region
labs(title="Relationship between fertility rate and GDP per capita by region", x="Fertility rate", y="GDP per capita") +
NULLAnswer below
We see a negative correlation between fertility rate and GDP per capita overall, meaning lower fertility signifies higher GDP per capita. This relationship is strong in regions such as East Asia, which makes sense because East Asia has a mix of development levels and high variation in fertility rate (ex: Japan has low fertility rate and high GDP per capita while the Philippines has higher fertility rate and lower GDP per capita). On the other hand, the pattern is less pronounced in regions such as Middle East and Africa where most countries have high fertility and low GDP per capita.
hiv_long %>%
filter(is.na(hiv)) %>% #Filter out all countries with data
mutate(region=countrycode(country, origin='country.name', destination='region23')) %>% #Extracting region from country name and creating a new column called 'region'
group_by(region) %>%
count() %>% #Count by region
ggplot() +
geom_col(aes(x=n, y=reorder(region, n))) + #Bar plot of count per region
labs(title="Missing HIV data", x="Count", y="Regions") +
NULL#Tidying data set
mortality <- worldbank_data_pretty_much_long %>%
filter(!is.na(SH.DYN.MORT)) %>% #Filtering out the NA's
select(-NY.GDP.PCAP.KD, -SE.PRM.NENR, -SP.DYN.TFRT.IN) %>% #Getting rid of -NY.GDP.PCAP.KD, -SE.PRM.NENR and -SP.DYN.TFRT.IN
mutate(region=countrycode(country, origin='country.name', destination='region')) #Extracting region from country name and creating a new column called 'region'
#Cleaning
mortality_clean <- mortality %>%
group_by(country) %>%
summarize(
startyear=min(year), #Extracting the mininum year as start year
endyear=max(year)) %>% #Extracting the maximum year end year
right_join(mortality, by='country') %>% #Joining mortality data set with the summarized table
mutate(
startmort=if_else(year == startyear, SH.DYN.MORT, 0), #new column called 'startmort'
endmort=if_else(year == endyear, SH.DYN.MORT, 0)) %>% #new column called 'endmort'
filter(startmort > 0 | endmort > 0) %>% #Filtering for startmort > 0 and endmort > 0
select(country, region, startmort, endmort) %>% #Extracting the 4 columns
group_by(country, region) %>%
summarise(
startmort=max(startmort), #maximum mortality at the beginning
endmort=max(endmort) #maximum mortality at the end
) %>%
mutate(change=(startmort-endmort)/startmort*100) %>% #Creating a new column called 'change' to see how much mortality rate has changed over the years
group_by(region)
mortality_clean %>%
slice_max(order_by=change, n=5) #Extracting the top 5 per region## # A tibble: 32 × 5
## # Groups: region [7]
## country region startmort endmort change
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Korea, Rep. East Asia & Pacific 112. 3.4 97.0
## 2 Singapore East Asia & Pacific 47.7 2.7 94.3
## 3 Japan East Asia & Pacific 39.7 2.7 93.2
## 4 Thailand East Asia & Pacific 146. 10.3 93.0
## 5 China East Asia & Pacific 118. 9.9 91.6
## 6 Portugal Europe & Central Asia 114. 3.6 96.9
## 7 Turkey Europe & Central Asia 257. 12.1 95.3
## 8 Italy Europe & Central Asia 51.9 3.4 93.4
## 9 Cyprus Europe & Central Asia 38.2 2.6 93.2
## 10 Poland Europe & Central Asia 65.1 4.7 92.8
## # … with 22 more rows
mortality_clean %>%
slice_min(order_by=change, n=5) #Extracting the lowest 5 per region## # A tibble: 32 × 5
## # Groups: region [7]
## country region startmort endmort change
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Micronesia, Fed. Sts. East Asia & Pacific 56.5 32.5 42.5
## 2 Korea, Dem. People's Rep. East Asia & Pacific 35.1 20 43.0
## 3 Palau East Asia & Pacific 36.4 19.1 47.5
## 4 Nauru East Asia & Pacific 74.1 33.9 54.3
## 5 Tuvalu East Asia & Pacific 80.5 26.4 67.2
## 6 Monaco Europe & Central Asia 9.7 3.4 64.9
## 7 Turkmenistan Europe & Central Asia 133. 42.2 68.2
## 8 Slovak Republic Europe & Central Asia 21.6 6.1 71.8
## 9 Ukraine Europe & Central Asia 33.8 9.2 72.8
## 10 Moldova Europe & Central Asia 64.1 15.3 76.1
## # … with 22 more rows
worldbank_data_pretty_much_long %>%
mutate(
region=countrycode(country, origin='country.name', destination='region'), #Extracting region from country name and creating a new column called 'region'
schoolSkip=100-SE.PRM.NENR) %>% #New column for inverted school enrollment called 'schoolSkip'
ggplot(aes(x=SP.DYN.TFRT.IN, y=schoolSkip)) + #Scatterplot for fertility and inverted school enrollment
geom_point() +
geom_smooth(se=F) + #Adding a smooth line
facet_wrap(~region) + #Creating different graphs for each region
labs(x="Fertility rate", y="School non-enrollment rate", title="Relationship between fertility rate and school non-enrollment by region") + #Labeling x-axis and y-axis
NULLAnswer below
There is a strong positive relationship between school non-attendance and fertility rate for South Asia and Latin America and the Caribbeans. This is not the case for developed regions such as Europe where most countries have lower fertility and higher school attendance rates.
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-08-23T14%3A32%3A29/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20210912%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210912T125015Z&X-Amz-Expires=300&X-Amz-Signature=72d07879242f104dae8f4deab29d9b69c30c5005f00c68d368b40b49422762a7&X-Amz-SignedHeaders=host]
## Date: 2021-09-12 12:51
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 173 kB
## <ON DISK> /var/folders/w9/djpqfftx73j_gmj24_lb0bwm0000gn/T//Rtmp934yxY/filea72dfbdddb.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))knitr::include_graphics("../images/tfl_distributions_monthly.png", error=F)Answer below
The grid above shows a large decrease in bike rentals in May and June 2020 compared to previous years. This huge decrease is clearly to do with COVID-19 lockdowns since people had to stay inside. We can also see that May and June have some variability year to year which most likely has to do with weather conditions in those two months (i.e. if it’s warmer in May 2018 than in May 2017, there would be more bike rentals in 2018).
knitr::include_graphics("../images/tfl_monthly.png", error=F)# Clean the data
bike_exp <- bike %>%
filter(year > 2015) %>% #Filter all the data that after 2015
group_by(month) %>%
summarise(expected_rentals=mean(bikes_hired)) # Calculate the expected rentals
# Replicate the first graph of actual and expected rentals for each month across years
plot <- bike %>%
filter(year > 2015) %>%
group_by(year, month) %>%
summarise(actual_rentals=mean(bikes_hired)) %>% # Calculate the actual mean rentals for each month
inner_join(bike_exp, by='month') %>% # Combine the data with original dataset
mutate(
up=if_else(actual_rentals > expected_rentals, actual_rentals - expected_rentals, 0),
down=if_else(actual_rentals < expected_rentals, expected_rentals - actual_rentals, 0)) %>% # Create the up and down variable for plotting the shaded area using geom_ribbon
ggplot(aes(x=month)) +
geom_line(aes(y=actual_rentals, group=1), size=0.1, colour='black') +
geom_line(aes(y=expected_rentals, group=1), size=0.7, colour='blue') + # Create lines for actual and expected rentals data for each month across years
geom_ribbon(aes(ymin=expected_rentals, ymax=expected_rentals+up, group=1), fill='#7DCD85', alpha=0.4) +
geom_ribbon(aes(ymin=expected_rentals, ymax=expected_rentals-down, group=1), fill='#CB454A', alpha=0.4) + # Create shaded areas and fill with different colors for up and down side
facet_wrap(~year) + # Facet the graphs by year
theme_bw() + # Theme
labs(title="Monthly changes in TfL bike rentals", subtitle="Change from monthly average shown in blue and calculated between 2016-2019", x="", y="Bike rentals") +
NULL
ggsave(file='bike1_plot.png', plot=plot, width=12, height=8) # Create and save the plot
knitr::include_graphics("bike1_plot.png", error=F)knitr::include_graphics("../images/tfl_weekly.png", error=F)# Clean the data
bike_exp_week <- bike %>%
filter(year > 2015) %>%
mutate(week=if_else(month == 'Jan' & week == 53, 1, week)) %>% # Create week variable for the dataset
group_by(week) %>%
summarise(expected_rentals=mean(bikes_hired))
# Make the graph
plot <- bike %>%
filter(year > 2015) %>%
mutate(week=if_else(month == 'Jan' & week == 53, 1, week)) %>%
group_by(year, week) %>%
summarise(actual_rentals=mean(bikes_hired)) %>%
inner_join(bike_exp_week, by='week') %>%
mutate(
actual_rentals=(actual_rentals-expected_rentals)/expected_rentals, #Calculate the excess rentals
up=if_else(actual_rentals > 0, actual_rentals, 0),
down=if_else(actual_rentals < 0, actual_rentals, 0), # Create the up and down variable for plotting the shaded area using geom_ribbon
colour=if_else(up > 0, 'G', 'R')) %>% # Define the colors for up and down side
ggplot(aes(x=week)) +
geom_rect(aes(xmin=13, xmax=26, ymin=-Inf, ymax=Inf), alpha=0.005) +
geom_rect(aes(xmin=39, xmax=53, ymin=-Inf, ymax=Inf), alpha=0.005) + # Add shaded grey areas for the according week ranges
geom_line(aes(y=actual_rentals, group=1), size=0.1, colour='black') +
geom_ribbon(aes(ymin=0, ymax=up, group=1), fill='#7DCD85', alpha=0.4) +
geom_ribbon(aes(ymin=down, ymax=0, group=1), fill='#CB454A', alpha=0.4) + # Create shaded areas and fill with different colors for up and down
geom_rug(aes(color=colour), sides='b') + # Plot rugs using geom_rug
scale_colour_manual(breaks=c('G', 'R'), values=c('#7DCD85', '#CB454A')) +
facet_wrap(~year) + # Facet by year
theme_bw() + # Theme
labs(title="Weekly changes in TfL bike rentals", subtitle="% change from weekly averages calculated between 2016-2019", x="week", y="") +
NULL
ggsave(file='bike2_plot.png', plot=plot, width=12, height=8) # Create and save the plot
knitr::include_graphics("bike2_plot.png", error=F)Should you use the mean or the median to calculate your expected rentals? Why? We use the mean to calculate the expected rentals.
As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.